Take-Home Exercise 2

Take-home Exercise 2: Applied Spatial Interaction Models: A case study of Singapore public bus commuter flows

Motivation and Objective

This take-home exercise is motivated by two main reasons. Firstly, despite increasing amounts of open data available for public consumption, there has not been significant practice research carried out to show how these disparate data sources can be integrated, analysed, and modelled to support policy making decisions.

Secondly, there is a general lack of practical research to show how geospatial data science and analysis (GDSA) can be used to support decision-making.

Hence, your task for this take-home exercise is to conduct a case study to demonstrate the potential value of GDSA to integrate publicly available data from multiple sources for building a spatial interaction models to determine factors affecting urban mobility patterns of public bus transit.

The Data

Open Government Data

For the purpose of this assignment, data from several open government sources will be used:

  • Passenger Volume by Origin Destination Bus Stops, Bus Stop Location, Train Station and Train Station Exit Point, just to name a few of them, from LTA DataMall.

  • Master Plan 2019 Subzone Boundary, HDB Property Information, School Directory and Information and other relevant data from Data.gov.sg.

Specially collected data

  • Businesses, retail and services, leisure and recreation, etc geospatial data sets assemble by course instructor. (Refer to eLearn)

The Task

Geospatial Data Science

  • Derive an analytical hexagon data of 325m (this distance is the perpendicular distance between the centre of the hexagon and its edges) to represent the traffic analysis zone (TAZ).

  • With reference to the time intervals provided in the table below, construct an O-D matrix of commuter flows for a time interval of your choice by integrating Passenger Volume by Origin Destination Bus Stops and Bus Stop Location from LTA DataMall. The O-D matrix must be aggregated at the analytics hexagon level

    Peak hour period Bus tap on time
    Weekday morning peak 6am to 9am
    Weekday afternoon peak 5pm to 8pm
    Weekend/holiday morning peak 11am to 2pm
    Weekend/holiday evening peak 4pm to 7pm
  • Display the O-D flows of the passenger trips by using appropriate geovisualisation methods (not more than 5 maps).

  • Describe the spatial patterns revealed by the geovisualisation (not more than 100 words per visual).

  • Assemble at least three propulsive and three attractiveness variables by using aspatial and geospatial from publicly available sources.

  • Compute a distance matrix by using the analytical hexagon data derived earlier.

Spatial Interaction Modelling

  • Calibrate spatial interactive models to determine factors affecting urban commuting flows at the selected time interval.

  • Present the modelling results by using appropriate geovisualisation and graphical visualisation methods. (Not more than 5 visuals)

  • With reference to the Spatial Interaction Model output tables, maps and data visualisation prepared, describe the modelling results. (not more than 100 words per visual).

FIRST STEP

Load in necessary packages:

pacman::p_load(tmap, sf, sp, DT, stplanr,
               performance, reshape2,
               ggpubr, tidyverse)

As we are aware, there is an increasing amounts of open data available, but there has not been significant practice research carried out to show how these disparate data sources can be integrated, analysed, and modelled to support policy making decisions. In this section I will be performing integration of various data sources.

I will first check the various database selected to see how is it possible to build a hollistic database.

Data Import, Extraction, Processing

Geospatial Data - Bus Stop

BusStop <- st_read(dsn="data/geospatial",
                  layer="BusStop")%>%
  st_transform(crs = 3414)
Reading layer `BusStop' from data source 
  `C:\cftoh\ISSS624\Take-home_Ex\Take-home_Ex2\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 5161 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 3970.122 ymin: 26482.1 xmax: 48284.56 ymax: 52983.82
Projected CRS: SVY21

I see that BusStop is a point geometry SF with “BUS_STOP_N”, “BUS_ROOF_N”, “LOC_DESC” and its geometry.

MPSZ <- st_read(dsn="data/geospatial",                   
                layer="MPSZ-2019")%>%   
  st_transform(crs = 3414)
Reading layer `MPSZ-2019' from data source 
  `C:\cftoh\ISSS624\Take-home_Ex\Take-home_Ex2\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 332 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 103.6057 ymin: 1.158699 xmax: 104.0885 ymax: 1.470775
Geodetic CRS:  WGS 84

I see that MPSZ is a Multipolygon geometry SF with “SUBZONE_N”, “SUBZONE_C”, “PLN_AREA_N”, “PLN_AREA_C”, “REGION_N”, “REGION_C” and its geometry.

Creating Hexagon

Traffic Analysis Zone (TAZ)

Traffic analysis zones are universally used in travel demand modelling to represent the spatial distribution of trip1 origins and destinations, as well as the population, employment and other spatial attributes that generate or otherwise influence travel demand. The urban area is divided into a set of mutually exclusive and collectively exhaustive zones. While travel actually occurs from one point in the urban region to another, all trip origins and destinations in a travel demand model are represented at the spatially aggregate level of the movement from an origin zone to a destination zone. These movements are further aggregated within network assignment models as originating and ending at single points within the origin and destination zones – the zone centroids.

As we should derive an analytical hexagon data of 325m (this distance is the perpendicular distance between the centre of the hexagon and its edges) to represent the TAZ,

Per the documentation of st_make_grid:

cellsize is numeric of length 1 or 2 with target cellsize: ..for hexagonal cells the distance between opposite edges (edge length is cellsize/sqrt(3)). A length units object can be passed, or an area unit object with area size of the square or hexagonal cell.

In this case, the distance between opposite edges should be 325m * 2. And the length of hexagon should be 325m.

BusStop_hexagon_grid = st_make_grid(BusStop, 325, what = "polygons", square = FALSE)

BusStop_hexagon_sf = st_sf(geometry = BusStop_hexagon_grid) %>%
  # add grid ID
  mutate(grid_id = 1:length(lengths(BusStop_hexagon_grid)))

We will next use st_intersection() for point and polygon overlay, to combine the data sets. This will provide us with output in point sf object.

BusStop_hexagon <- st_intersection(BusStop_hexagon_sf, BusStop) %>%   
  select(1,2,4) %>%   
  st_drop_geometry()
Warning: attribute variables are assumed to be spatially constant throughout
all geometries
MPSZ_hexagon <- st_intersection(BusStop_hexagon_sf, MPSZ)  %>%  
  select(1:3) 
Warning: attribute variables are assumed to be spatially constant throughout
all geometries
BusStop_MPSZ_hexagon <- left_join(BusStop_hexagon,MPSZ_hexagon,
                                  by = "grid_id")
Warning in left_join(BusStop_hexagon, MPSZ_hexagon, by = "grid_id"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 2286 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

Note this BusStop_MPSZ_hexagon only contain the hexagon which has got bus stop within the area.

write_rds(BusStop_hexagon, "data/rds/BusStop_hexagon.rds")

Performing the Relational Join (to update attribute table of one geospatial with another aspatial data set)

Now we will next combine this onto our our odbs_peak data frame which shows the total number of trips from particular bus stop during peak hour.

Aspatial Data

Next we import the Aspatial Data of PASSENGER VOLUME BY ORIGIN DESTINATION BUS STATIONS, downloaded via API (postman GET) from Data Mall LTA. For the purpose of this exercise the Aug 2023 Data will be used.

OD_bus <- read_csv("data/aspatial/origin_destination_bus_202308.csv")
Rows: 5709512 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): YEAR_MONTH, DAY_TYPE, PT_TYPE, ORIGIN_PT_CODE, DESTINATION_PT_CODE
dbl (2): TIME_PER_HOUR, TOTAL_TRIPS

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#non-spatial data with no geometry features
OD_bus$ORIGIN_PT_CODE <- as.factor(OD_bus$ORIGIN_PT_CODE)
OD_bus$DESTINATION_PT_CODE <- as.factor(OD_bus$DESTINATION_PT_CODE) 

Noted that the aspatial data for bus has total of 7 columns YEAR_MONTH, DAY_TYPE, PT_TYPE, ORIGIN_PT_CODE, DESTINATION_PT_CODE, TIME_PER_HOUR, TOTAL_TRIPS.

For the purpose of this exercise, we only look at those Bus tap on time on Weekday morning peak, between 6am to 9am.

OD_bus_peak <- OD_bus %>% 
  filter(DAY_TYPE == "WEEKDAY" & 
           (TIME_PER_HOUR >=6 & TIME_PER_HOUR <=9))  
summary(OD_bus_peak)
  YEAR_MONTH          DAY_TYPE         TIME_PER_HOUR     PT_TYPE         
 Length:669211      Length:669211      Min.   :6.000   Length:669211     
 Class :character   Class :character   1st Qu.:7.000   Class :character  
 Mode  :character   Mode  :character   Median :8.000   Mode  :character  
                                       Mean   :7.595                     
                                       3rd Qu.:9.000                     
                                       Max.   :9.000                     
                                                                         
 ORIGIN_PT_CODE   DESTINATION_PT_CODE  TOTAL_TRIPS      
 22009  :  1925   22009  :  1803      Min.   :    1.00  
 84009  :  1817   52009  :  1785      1st Qu.:    2.00  
 75009  :  1757   84009  :  1732      Median :    7.00  
 52009  :  1678   75009  :  1658      Mean   :   39.49  
 59009  :  1656   59009  :  1534      3rd Qu.:   24.00  
 46009  :  1535   46009  :  1506      Max.   :29151.00  
 (Other):658843   (Other):659193                        

Computing the distance matrix

We use the spDists() coz faster.

BusStop_hexagon_sp <- as(BusStop_hexagon_sf, "Spatial")
dist <- spDists(BusStop_hexagon_sp,                  
                longlat = FALSE)  
#because or subzone is already svr21. Otherwise it will treat data as x and y, then calculate great circle distance.  
head(dist, n=c(10, 10)) #only list first 10 col and 10 rows
           [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]
 [1,]    0.0000  562.9165 1125.8330 1688.7495 2251.6660 2814.5826 3377.4991
 [2,]  562.9165    0.0000  562.9165 1125.8330 1688.7495 2251.6660 2814.5826
 [3,] 1125.8330  562.9165    0.0000  562.9165 1125.8330 1688.7495 2251.6660
 [4,] 1688.7495 1125.8330  562.9165    0.0000  562.9165 1125.8330 1688.7495
 [5,] 2251.6660 1688.7495 1125.8330  562.9165    0.0000  562.9165 1125.8330
 [6,] 2814.5826 2251.6660 1688.7495 1125.8330  562.9165    0.0000  562.9165
 [7,] 3377.4991 2814.5826 2251.6660 1688.7495 1125.8330  562.9165    0.0000
 [8,] 3940.4156 3377.4991 2814.5826 2251.6660 1688.7495 1125.8330  562.9165
 [9,] 4503.3321 3940.4156 3377.4991 2814.5826 2251.6660 1688.7495 1125.8330
[10,] 5066.2486 4503.3321 3940.4156 3377.4991 2814.5826 2251.6660 1688.7495
           [,8]      [,9]     [,10]
 [1,] 3940.4156 4503.3321 5066.2486
 [2,] 3377.4991 3940.4156 4503.3321
 [3,] 2814.5826 3377.4991 3940.4156
 [4,] 2251.6660 2814.5826 3377.4991
 [5,] 1688.7495 2251.6660 2814.5826
 [6,] 1125.8330 1688.7495 2251.6660
 [7,]  562.9165 1125.8330 1688.7495
 [8,]    0.0000  562.9165 1125.8330
 [9,]  562.9165    0.0000  562.9165
[10,] 1125.8330  562.9165    0.0000

Notice that the output dist is a matrix object class of R. Also notice that the column heanders and row headers are not labeled with the planning subzone codes.

Labelling column and row headers of a distance matrix

First, we will create a list sorted according to the the distance matrix by planning sub-zone code.

grid <- BusStop_hexagon_sf$grid_id

Next we will attach SUBZONE_C to row and column for distance matrix matching ahead

colnames(dist) <- paste0(grid) 
rownames(dist) <- paste0(grid)

Pivoting distance value by SUBZONE_C

Next, we will pivot the distance matrix into a long table by using the row and column subzone codes as show in the code chunk below.

distPair <- melt(dist) %>% 
  rename(dist = value)
head(distPair, 10)
   Var1 Var2      dist
1     1    1    0.0000
2     2    1  562.9165
3     3    1 1125.8330
4     4    1 1688.7495
5     5    1 2251.6660
6     6    1 2814.5826
7     7    1 3377.4991
8     8    1 3940.4156
9     9    1 4503.3321
10   10    1 5066.2486

Note that melt() is a old reshape tool function, that take dist matrix and convert it to long table, 1. origin, 2. destination, 3. distance matrix

Notice that the within zone distance is 0.

Updating intra-zonal distances

Now I am going to append a constant value to replace the intra-zonal distance of 0, first select and find out the minimum value of the distance by using summary().

distPair <- distPair %>%
  rename(orig = Var1,
         dest = Var2)
distPair %>%   
  filter(dist > 0) %>%   
  summary()
      orig            dest            dist      
 Min.   :    1   Min.   :    1   Min.   :  325  
 1st Qu.: 3301   1st Qu.: 3301   1st Qu.:11500  
 Median : 6600   Median : 6600   Median :18086  
 Mean   : 6600   Mean   : 6600   Mean   :18991  
 3rd Qu.: 9900   3rd Qu.: 9900   3rd Qu.:25526  
 Max.   :13200   Max.   :13200   Max.   :51798  
write_rds(distPair, "data/rds/distPair.rds") 

Preparing flow data

Note that although I would want to keep all the fields intact as I wasnt sure what are the data I need for the next steps. It is taking too much processing power and space, so i will summarize and aggregate the value of selected time.

BusStop_Trips <- left_join(OD_bus_peak, BusStop_hexagon, #the left join is so to get grid ID from hexagon file
                      by = c("ORIGIN_PT_CODE" = "BUS_STOP_N")) %>%
  drop_na() %>%
  rename(ORIGIN_BS = ORIGIN_PT_CODE,
         ORIGIN_GRID = grid_id,
         DESTIN_BS = DESTINATION_PT_CODE
         )%>%
  group_by(ORIGIN_GRID, ORIGIN_BS, DESTIN_BS) %>% 
  summarize(TRIPS = sum(TOTAL_TRIPS))
Warning in left_join(OD_bus_peak, BusStop_hexagon, by = c(ORIGIN_PT_CODE = "BUS_STOP_N")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 490 of `x` matches multiple rows in `y`.
ℹ Row 4426 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
`summarise()` has grouped output by 'ORIGIN_GRID', 'ORIGIN_BS'. You can
override using the `.groups` argument.

But there isn’t grid ID for my destination so i will now do the following to get complete picture:

BusStop_Trips <- BusStop_Trips %>%
  left_join(BusStop_hexagon, #the left join is so to get grid ID from hexagon file
                      by = c("DESTIN_BS" = "BUS_STOP_N")) %>%
  drop_na() %>%
    rename(DESTIN_GRID = grid_id)%>%
  select(1,2,5,3,4)
Warning in left_join(., BusStop_hexagon, by = c(DESTIN_BS = "BUS_STOP_N")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 2075 of `x` matches multiple rows in `y`.
ℹ Row 2732 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
BusStop_Trips <- unique(BusStop_Trips)

Visualising Spatial Interaction

Separating intra-flow from passenger volume df

To add three new fields in BusStop_Trips dataframe, that is to be used for flow.

str(BusStop_Trips)
gropd_df [236,683 × 5] (S3: grouped_df/tbl_df/tbl/data.frame)
 $ ORIGIN_GRID: int [1:236683] 52 52 52 52 52 52 149 149 149 200 ...
 $ ORIGIN_BS  : chr [1:236683] "25059" "25059" "25059" "25059" ...
 $ DESTIN_GRID: int [1:236683] 783 345 295 580 630 735 783 345 735 2079 ...
 $ DESTIN_BS  : chr [1:236683] "24719" "25709" "25719" "25771" ...
 $ TRIPS      : num [1:236683] 54 2 1 2 1 2 34 13 3 2 ...
 - attr(*, "groups")= tibble [4,962 × 3] (S3: tbl_df/tbl/data.frame)
  ..$ ORIGIN_GRID: int [1:4962] 52 149 200 201 245 295 295 295 298 342 ...
  ..$ ORIGIN_BS  : chr [1:4962] "25059" "25751" "26379" "26369" ...
  ..$ .rows      : list<int> [1:4962] 
  .. ..$ : int [1:6] 1 2 3 4 5 6
  .. ..$ : int [1:3] 7 8 9
  .. ..$ : int [1:5] 10 11 12 13 14
  .. ..$ : int [1:8] 15 16 17 18 19 20 21 22
  .. ..$ : int [1:9] 23 24 25 26 27 28 29 30 31
  .. ..$ : int [1:8] 32 33 34 35 36 37 38 39
  .. ..$ : int [1:10] 40 41 42 43 44 45 46 47 48 49
  .. ..$ : int [1:14] 50 51 52 53 54 55 56 57 58 59 ...
  .. ..$ : int [1:10] 64 65 66 67 68 69 70 71 72 73
  .. ..$ : int [1:7] 74 75 76 77 78 79 80
  .. ..$ : int [1:6] 81 82 83 84 85 86
  .. ..$ : int [1:7] 87 88 89 90 91 92 93
  .. ..$ : int [1:11] 94 95 96 97 98 99 100 101 102 103 ...
  .. ..$ : int [1:16] 105 106 107 108 109 110 111 112 113 114 ...
  .. ..$ : int [1:7] 121 122 123 124 125 126 127
  .. ..$ : int [1:12] 128 129 130 131 132 133 134 135 136 137 ...
  .. ..$ : int [1:17] 140 141 142 143 144 145 146 147 148 149 ...
  .. ..$ : int [1:8] 157 158 159 160 161 162 163 164
  .. ..$ : int [1:12] 165 166 167 168 169 170 171 172 173 174 ...
  .. ..$ : int [1:8] 177 178 179 180 181 182 183 184
  .. ..$ : int [1:14] 185 186 187 188 189 190 191 192 193 194 ...
  .. ..$ : int [1:9] 199 200 201 202 203 204 205 206 207
  .. ..$ : int [1:4] 208 209 210 211
  .. ..$ : int [1:9] 212 213 214 215 216 217 218 219 220
  .. ..$ : int [1:7] 221 222 223 224 225 226 227
  .. ..$ : int [1:5] 228 229 230 231 232
  .. ..$ : int [1:7] 233 234 235 236 237 238 239
  .. ..$ : int [1:7] 240 241 242 243 244 245 246
  .. ..$ : int 247
  .. ..$ : int [1:4] 248 249 250 251
  .. ..$ : int [1:5] 252 253 254 255 256
  .. ..$ : int [1:5] 257 258 259 260 261
  .. ..$ : int [1:20] 262 263 264 265 266 267 268 269 270 271 ...
  .. ..$ : int [1:4] 282 283 284 285
  .. ..$ : int [1:13] 286 287 288 289 290 291 292 293 294 295 ...
  .. ..$ : int [1:4] 299 300 301 302
  .. ..$ : int [1:7] 303 304 305 306 307 308 309
  .. ..$ : int [1:8] 310 311 312 313 314 315 316 317
  .. ..$ : int [1:8] 318 319 320 321 322 323 324 325
  .. ..$ : int [1:4] 326 327 328 329
  .. ..$ : int [1:14] 330 331 332 333 334 335 336 337 338 339 ...
  .. ..$ : int [1:6] 344 345 346 347 348 349
  .. ..$ : int [1:6] 350 351 352 353 354 355
  .. ..$ : int [1:5] 356 357 358 359 360
  .. ..$ : int [1:10] 361 362 363 364 365 366 367 368 369 370
  .. ..$ : int [1:6] 371 372 373 374 375 376
  .. ..$ : int [1:6] 377 378 379 380 381 382
  .. ..$ : int [1:4] 383 384 385 386
  .. ..$ : int [1:16] 387 388 389 390 391 392 393 394 395 396 ...
  .. ..$ : int [1:5] 403 404 405 406 407
  .. ..$ : int [1:9] 408 409 410 411 412 413 414 415 416
  .. ..$ : int [1:7] 417 418 419 420 421 422 423
  .. ..$ : int [1:5] 424 425 426 427 428
  .. ..$ : int [1:3] 429 430 431
  .. ..$ : int [1:6] 432 433 434 435 436 437
  .. ..$ : int [1:8] 438 439 440 441 442 443 444 445
  .. ..$ : int [1:2] 446 447
  .. ..$ : int [1:9] 448 449 450 451 452 453 454 455 456
  .. ..$ : int [1:3] 457 458 459
  .. ..$ : int [1:4] 460 461 462 463
  .. ..$ : int [1:4] 464 465 466 467
  .. ..$ : int [1:5] 468 469 470 471 472
  .. ..$ : int [1:4] 473 474 475 476
  .. ..$ : int [1:32] 477 478 479 480 481 482 483 484 485 486 ...
  .. ..$ : int [1:5] 509 510 511 512 513
  .. ..$ : int [1:42] 514 515 516 517 518 519 520 521 522 523 ...
  .. ..$ : int [1:8] 556 557 558 559 560 561 562 563
  .. ..$ : int [1:7] 564 565 566 567 568 569 570
  .. ..$ : int [1:5] 571 572 573 574 575
  .. ..$ : int [1:5] 576 577 578 579 580
  .. ..$ : int [1:3] 581 582 583
  .. ..$ : int [1:3] 584 585 586
  .. ..$ : int [1:17] 587 588 589 590 591 592 593 594 595 596 ...
  .. ..$ : int [1:3] 604 605 606
  .. ..$ : int [1:106] 607 608 609 610 611 612 613 614 615 616 ...
  .. ..$ : int [1:7] 713 714 715 716 717 718 719
  .. ..$ : int [1:12] 720 721 722 723 724 725 726 727 728 729 ...
  .. ..$ : int 732
  .. ..$ : int [1:3] 733 734 735
  .. ..$ : int [1:6] 736 737 738 739 740 741
  .. ..$ : int [1:3] 742 743 744
  .. ..$ : int [1:25] 745 746 747 748 749 750 751 752 753 754 ...
  .. ..$ : int [1:9] 770 771 772 773 774 775 776 777 778
  .. ..$ : int [1:34] 779 780 781 782 783 784 785 786 787 788 ...
  .. ..$ : int [1:10] 813 814 815 816 817 818 819 820 821 822
  .. ..$ : int 823
  .. ..$ : int [1:41] 824 825 826 827 828 829 830 831 832 833 ...
  .. ..$ : int [1:4] 865 866 867 868
  .. ..$ : int [1:11] 869 870 871 872 873 874 875 876 877 878 ...
  .. ..$ : int [1:5] 880 881 882 883 884
  .. ..$ : int [1:2] 885 886
  .. ..$ : int [1:2] 887 888
  .. ..$ : int [1:9] 889 890 891 892 893 894 895 896 897
  .. ..$ : int [1:4] 898 899 900 901
  .. ..$ : int 902
  .. ..$ : int [1:3] 903 904 905
  .. ..$ : int [1:10] 906 907 908 909 910 911 912 913 914 915
  .. ..$ : int [1:3] 916 917 918
  .. ..$ : int [1:6] 919 920 921 922 923 924
  .. .. [list output truncated]
  .. ..@ ptype: int(0) 
  ..- attr(*, ".drop")= logi TRUE

This above checks show that my ORIGIN_BS column is chr, therefore i will now convert it to factor:

BusStop_Trips$ORIGIN_BS <- as.factor(BusStop_Trips$ORIGIN_BS)
BusStop_Trips$FlowNoIntra <- ifelse(
  BusStop_Trips$ORIGIN_GRID == BusStop_Trips$DESTIN_GRID, 
  0, BusStop_Trips$TRIPS)
BusStop_Trips$offset <- ifelse(
  BusStop_Trips$ORIGIN_GRID == BusStop_Trips$DESTIN_GRID, 
  0.000001, 1)

Combining passenger volume data with distance value

BusStop_Trips1 <- BusStop_Trips %>%
  left_join (distPair,
             by = c("ORIGIN_GRID" = "orig",
                    "DESTIN_GRID" = "dest"))
summary(BusStop_Trips1)
  ORIGIN_GRID      ORIGIN_BS       DESTIN_GRID     DESTIN_BS        
 Min.   :   52   22009  :   591   Min.   :   52   Length:236683     
 1st Qu.: 5846   84009  :   564   1st Qu.: 5956   Class :character  
 Median : 7637   75009  :   539   Median : 7602   Mode  :character  
 Mean   : 7473   52009  :   538   Mean   : 7457                     
 3rd Qu.: 9087   64009  :   451   3rd Qu.: 9035                     
 Max.   :13117   46009  :   444   Max.   :13117                     
                 (Other):233556                                     
     TRIPS          FlowNoIntra          offset              dist      
 Min.   :    1.0   Min.   :    0.0   Min.   :0.000001   Min.   :    0  
 1st Qu.:    3.0   1st Qu.:    3.0   1st Qu.:1.000000   1st Qu.: 1810  
 Median :   15.0   Median :   15.0   Median :1.000000   Median : 3748  
 Mean   :  109.5   Mean   :  109.4   Mean   :0.995475   Mean   : 4865  
 3rd Qu.:   58.0   3rd Qu.:   58.0   3rd Qu.:1.000000   3rd Qu.: 6917  
 Max.   :96630.0   Max.   :96630.0   Max.   :1.000000   Max.   :24758  
                                                                       

From the summary above, it is noted that there are some origin and destination bus stops within the same hexagon grid. For the purpose of this exercise we will exclude them, as the aim is to find out the interozonal (hexagon) flow.

BusStop_Trips2 <- BusStop_Trips1 %>%
  filter(dist>0)
summary(BusStop_Trips2)
  ORIGIN_GRID      ORIGIN_BS       DESTIN_GRID     DESTIN_BS        
 Min.   :   52   22009  :   591   Min.   :   52   Length:235612     
 1st Qu.: 5866   84009  :   563   1st Qu.: 5956   Class :character  
 Median : 7639   75009  :   539   Median : 7602   Mode  :character  
 Mean   : 7475   52009  :   538   Mean   : 7459                     
 3rd Qu.: 9087   64009  :   451   3rd Qu.: 9035                     
 Max.   :13117   46009  :   443   Max.   :13117                     
                 (Other):232487                                     
     TRIPS          FlowNoIntra          offset       dist      
 Min.   :    1.0   Min.   :    1.0   Min.   :1   Min.   :  325  
 1st Qu.:    3.0   1st Qu.:    3.0   1st Qu.:1   1st Qu.: 1810  
 Median :   15.0   Median :   15.0   Median :1   Median : 3748  
 Mean   :  109.9   Mean   :  109.9   Mean   :1   Mean   : 4887  
 3rd Qu.:   59.0   3rd Qu.:   59.0   3rd Qu.:1   3rd Qu.: 6917  
 Max.   :96630.0   Max.   :96630.0   Max.   :1   Max.   :24758  
                                                                

Now we see the minimum is 325, which is the the perpendicular distance between the centre of the hexagon and its edges (next hexagon).

Visualization for the TOTAL TRIPS taken at Origin and Destination hexagon area

origin_density_map <- left_join(BusStop_hexagon_sf, BusStop_Trips1,
                                by =c("grid_id" = "ORIGIN_GRID")) %>%
  drop_na() %>%
  rename(ORIGIN_GRID = grid_id) %>%
  group_by(ORIGIN_GRID) %>% 
  summarize(TOTAL_TRIPS = sum(TRIPS), AVERAGE_DIST = weighted.mean(dist, w = TRIPS))

destin_density_map <- left_join(BusStop_hexagon_sf, BusStop_Trips1,
                                by =c("grid_id" = "DESTIN_GRID")) %>%
  drop_na() %>%
  rename(DESTIN_GRID = grid_id) %>%
  group_by(DESTIN_GRID) %>% 
  summarize(TOTAL_TRIPS = sum(TRIPS), AVERAGE_DIST = weighted.mean(dist, w = TRIPS))

We will just narrow down to look at number of trips above 100k during this period.

tmap_mode("view")
tmap mode set to interactive viewing
#tmap_options(check.and.fix = TRUE)
map_honeycomb = tm_shape(origin_density_map %>%
                         filter(TOTAL_TRIPS>100000)) +
  tm_fill(
    col = "TOTAL_TRIPS",
    palette = "Reds",
    style = "cont",
    title = "Number of Trips by Origin Bus Stop within the Area",
    alpha = 0.4,
    popup.vars = c("Number of TRIPS: " = "TOTAL_TRIPS"),
    popup.format = list(TOTAL_TRIPS = list(format = "f", digits = 0))) +
  tm_borders(col = "grey40", lwd = 0.7)+
  tm_scale_bar()

map_honeycomb1 = tm_shape(destin_density_map %>%
                         filter(TOTAL_TRIPS>100000))+ 
  tm_fill("TOTAL_TRIPS", 
          style = "cont", 
          palette = "Reds",
          title = "Number of Trips by Destination Bus Stop within the Area",
          alpha = 0.4,
          popup.vars = c("Number of TRIPS: " = "TOTAL_TRIPS"),
          popup.format = list(TOTAL_TRIPS = list(format = "f", digits = 0))) +
  tm_borders(col = "grey40", lwd = 0.7)+
  tm_scale_bar()

tmap_arrange(map_honeycomb, map_honeycomb1, asp=2, ncol=2)

Insights

It is quite surprising to see that there isn’t much alight activity amongst the bus stataion in CBD area, given that this is Weekday morning peak, between 6am to 9am. However, having another thought, this timing is probably more relevant to students going to school (since classes start early from 7 to 8am). Likely the CBD crowd only kicks in from 8-9am, which is significantly lesser in comparison to the whole dataset. We can dive down further (segregate the timing e.g. do one 6-8, one 7-9, to see any difference in results) as the next project.

An interesting sight is that grid_id 5700 has high number of BOTH incoming and outgoing traffic, as reflected by the 301.699 total number of trips as origin area, and 422,497 total number of trips as destination area. This area is likely to be either a bus interchange, or a popular bus stop.

BusStop_Trips1[BusStop_Trips1$ORIGIN_GRID == 5700,]
# A tibble: 529 × 8
# Groups:   ORIGIN_GRID, ORIGIN_BS [3]
   ORIGIN_GRID ORIGIN_BS DESTIN_GRID DESTIN_BS TRIPS FlowNoIntra offset   dist
         <int> <fct>           <int> <chr>     <dbl>       <dbl>  <dbl>  <dbl>
 1        5700 46009            7833 01019        70          70      1 17120.
 2        5700 46009            7930 01059       175         175      1 16759.
 3        5700 46009            7881 01119       190         190      1 16937.
 4        5700 46009            7880 02049        21          21      1 17444.
 5        5700 46009            8024 02089       110         110      1 17910.
 6        5700 46009            8120 02099         1           1      1 18057.
 7        5700 46009            7786 07539        99          99      1 16289.
 8        5700 46009            7882 07589        55          55      1 16434.
 9        5700 46009            6675 14009       183         183      1 19164.
10        5700 46009            6627 14139        16          16      1 18858.
# ℹ 519 more rows

We noted the bus stop is 46009

BusStop[BusStop$BUS_STOP_N == 46009,]
Simple feature collection with 1 feature and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 22836.28 ymin: 46567.63 xmax: 22836.28 ymax: 46567.63
Projected CRS: SVY21 / Singapore TM
     BUS_STOP_N BUS_ROOF_N          LOC_DESC                  geometry
2178      46009        INT WOODLANDS REG INT POINT (22836.28 46567.63)

From the above, we found out that this bus stop is a Interchange station in Woodlands, a densely populated area.

Similarly, we have also noted a few grids where BOTH incoming and outgoing traffic are relatively high. To draw a similar analysis:

list <- intersect(origin_density_map$ORIGIN_GRID[origin_density_map$TOTAL_TRIPS>100000]
        , destin_density_map$DESTIN_GRID[destin_density_map$TOTAL_TRIPS>100000])

So we know that the grid with >100k BOTH incoming and outgoing traffic are:

list
 [1]  2993  4250  4908  5700  7648  7703  8467  9532 10286 10772

Next we find out where are these grid:

BusStop_hexagon[BusStop_hexagon$grid_id %in% list, "LOC_DESC"]
 [1] "TAMPINES INT"                "SERANGOON INT"              
 [3] "TAMPINES INT"                "S'GOON STN EXIT A / BLK 413"
 [5] "WOODLANDS REG INT"           "BOON LAY INT"               
 [7] "S'GOON STN EXIT C / BLK 201" "BEF PUNGGOL RD"             
 [9] "CLEMENTI STN"                "CHOA CHU KANG INT"          
[11] "CLEMENTI STN"                "S'GOON STN EXIT E"          
[13] "LOT 1 / CHOA CHU KANG STN"   "AFT ANG MO KIO STN EXIT A"  
[15] "AFT PUNGGOL RD"              "TOA PAYOH BUS INT"          
[17] "TOA PAYOH BUS INT"           "BEDOK BUS INT"              
[19] "BEDOK BUS INT"               "S'GOON STN EXIT H"          
[21] "ANG MO KIO INT"              "Aft Tampines Stn Exit E"    
[23] "ANG MO KIO STN"              "W'LANDS STN EXIT 4"         
[25] "W'LANDS STN EXIT 5"          "Tampines Stn Exit D"        
[27] "OPP CHOA CHU KANG STN"       "BEDOK STN EXIT A"           

Insights

Noted these are mainly interchange or MRT station so probably make sense as people may change to bus / train at these spot, and there may also be larger number of buses available in these station.

Visualization for the Average distance (weighted) taken at Origin and Destination hexagon area

Now we look at the total average weighted distance traveled.

summary(origin_density_map)
  ORIGIN_GRID     TOTAL_TRIPS      AVERAGE_DIST            geometry   
 Min.   :   52   Min.   :     1   Min.   :  325   POLYGON      :2451  
 1st Qu.: 4688   1st Qu.:   775   1st Qu.: 1517   epsg:3414    :   0  
 Median : 7091   Median :  4603   Median : 2188   +proj=tmer...:   0  
 Mean   : 6762   Mean   : 10579   Mean   : 2426                       
 3rd Qu.: 8908   3rd Qu.: 13280   3rd Qu.: 3051                       
 Max.   :13117   Max.   :365868   Max.   :14379                       
summary(destin_density_map)
  DESTIN_GRID     TOTAL_TRIPS      AVERAGE_DIST            geometry   
 Min.   :   52   Min.   :     1   Min.   :  325   POLYGON      :2456  
 1st Qu.: 4685   1st Qu.:  1543   1st Qu.: 1418   epsg:3414    :   0  
 Median : 7089   Median :  4246   Median : 2099   +proj=tmer...:   0  
 Mean   : 6759   Mean   : 10557   Mean   : 2435                       
 3rd Qu.: 8906   3rd Qu.: 10514   3rd Qu.: 3062                       
 Max.   :13117   Max.   :422497   Max.   :13993                       

We see the mean is around 2500 average distance, and max of around 14,000 distance. As a gauge we will look at data above 8,000 (8km) average distance:

tmap_mode("view")
tmap mode set to interactive viewing
#tmap_options(check.and.fix = TRUE)
map_honeycomb2 = tm_shape(origin_density_map %>%
                         filter(AVERAGE_DIST>8000)) +
  tm_fill(
    col = "AVERAGE_DIST",
    palette = "Reds",
    style = "cont",
    title = "Average Distance by Origin Bus Stop within the Area",
    alpha = 0.4,
    popup.vars = c("Average DISTANCE travelled from this origin bus stop: " = "AVERAGE_DIST"),
    popup.format = list(AVERAGE_DIST = list(format = "f", digits = 0))) +
  tm_borders(col = "grey40", lwd = 0.7)+
  tm_scale_bar()

map_honeycomb3 = tm_shape(destin_density_map %>%
                         filter(AVERAGE_DIST>8000)) +
  tm_fill("AVERAGE_DIST", 
          style = "cont", 
          palette = "Reds",
          title = "Number of Trips by Destination Bus Stop within the Area",
          alpha = 0.4,
          popup.vars = c("Average DISTANCE travelled to this destination bus stop: " = "AVERAGE_DIST"),
          popup.format = list(AVERAGE_DIST = list(format = "f", digits = 0))) +
  tm_borders(col = "grey40", lwd = 0.7)+
  tm_scale_bar()

tmap_arrange(map_honeycomb2, map_honeycomb3, asp=2, ncol=2)
#FOR THOSE >8km AVERAGE TRAVEL DISTANCE WITH THESE ORIGIN BUS STOP
BusStop_hexagon[BusStop_hexagon$grid_id %in% origin_density_map$ORIGIN_GRID[origin_density_map$AVERAGE_DIST>8000], "LOC_DESC"]
 [1] "CHANGI AIRPORT PTB1"     "AFT CHANGI AIRPORT PTB2"
 [3] "AFT LIM CHU KANG LANE 8" "CHANGI AIRPORT PTB3"    
 [5] "CHANGI AIRPORT PTB2"     "PROLOGIS"               
 [7] "DHL"                     "BEF CP E1"              
 [9] "YISHUN DAIRY FARM"       "BEF CP D5"              
[11] "SIGLAP LK"               "Aft Mandai Rd"          
[13] "POLICE COAST GUARD"      "BLK 402"                
[15] "AFT CHANGI FERRY RD"     "OPP CASABLANCA"         
[17] "BEF CHANGI FERRY RD"     "EXPEDITORS"             
[19] "OPP MANDARIN GDNS"       "BAYFRONT STN EXIT B/MBS"
#FOR THOSE >8km AVERAGE TRAVEL DISTANCE WITH THESE DESTINATION BUS STOP
BusStop_hexagon[BusStop_hexagon$grid_id %in% destin_density_map$DESTIN_GRID[destin_density_map$AVERAGE_DIST>8000], "LOC_DESC"]
 [1] "ST ANNE'S CH"              "AIRPORT POLICE STN"       
 [3] "AFT DAIRY FARM RD"         "DHOBY GHAUT STN"          
 [5] "CHANGI AIRPORT PTB1"       "THE SAIL"                 
 [7] "DOWNTOWN STN"              "MARINA BAY FINANCIAL CTR" 
 [9] "AFT LIM CHU KANG LANE 8"   "AFT SLE"                  
[11] "CHANGI AIRPORT PTB3"       "CHANGI AIRPORT PTB2"      
[13] "PROLOGIS"                  "DHL"                      
[15] "TANJONG PAGAR STN EXIT C"  "BEF CHANGI AIRPORT PTB3"  
[17] "MAPLETREE ANSON"           "POLICE COAST GUARD"       
[19] "CHANGI NAVAL BASE"         "OPP CHANGI NAVAL BASE"    
[21] "BEF STEVENS STN EXIT 4"    "AFT STEVENS STN EXIT 5"   
[23] "AFT CHANGI FERRY RD"       "CASABLANCA"               
[25] "LP 94"                     "ORCHARD STN / TANGS PLAZA"
[27] "Marina Bay Stn"            "BEF YISHUN AVE 1"         
[29] "BEF CHANGI FERRY RD"       "OPP DB SCHENKER"          
[31] "OPP ORCHARD STN / ION"     "YMCA"                     
[33] "AFT DHOBY GHAUT STN"       "MENLO WORLDWIDE"          
[35] "EXPEDITORS"                "DHOBY GHAUT STN EXIT B"   
[37] "NEAR SATS FLIGHT KITCHEN" 

Insights

Another interesting observation. For this we will just omit bus stop to and from changi airport, as it can be understood that bus to and from airport are typically of longer distance. The same applies to Changi Naval base/ Opp Changi Naval base as the bus stop is probably serviced by a long distance bus.

The origin bus stops that meet this criteria of >8km average distance are mainly from woodlands/ yishun / causeway. There are also people taking long distance bus and alight at central/ MBFC /Mapletree / CBD /Seletar / Town area, mainly for work.

Visualization of OD FLOW

Creating desire lines

Read the documentation on od2line here

simpleBS_hexagon <- left_join(BusStop_Trips, BusStop_hexagon_sf,
                                 by = c("ORIGIN_GRID" = "grid_id")) %>%
  drop_na() %>%
  group_by(ORIGIN_GRID) %>%
  select(8,1) 

simpleBS_hexagon <- unique(simpleBS_hexagon)
simpleBS_hexagon <- st_sf(geometry = simpleBS_hexagon)
BS_flowLine <- BusStop_Trips1 %>%
  drop_na()  %>%
  group_by(ORIGIN_GRID, DESTIN_GRID) %>% 
  summarize(TOTAL_TRIPS = sum(TRIPS)) %>% 
  filter(TOTAL_TRIPS>1000)
`summarise()` has grouped output by 'ORIGIN_GRID'. You can override using the
`.groups` argument.
BS_flowLine <- unique(BS_flowLine)
flowLine <- od2line(flow = BS_flowLine,                     
                    origin_code = "ORIGIN_GRID",                     
                    dest_code = "DESTIN_GRID",                     
                    zones = BusStop_hexagon_sf,                     
                    zone_code = "grid_id") 
Creating centroids representing desire line start and end points.

Visualising the desire lines

To dive down, we will look at OD visualizatiion for total trips >1,000 during the timeframe.

tmap_options(check.and.fix = TRUE)
tm_shape(MPSZ) +   
  tm_polygons() + tm_fill(alpha = 0.1) +
tm_shape(simpleBS_hexagon) +
  tm_polygons() +
flowLine %>%  
tm_shape() +
  tm_lines(lwd = "TOTAL_TRIPS",
           style = "quantile",
           scale = c(0.1, 1, 3, 5, 7, 10),
           n = 6,
           alpha = 0.3)
Warning: The shape MPSZ is invalid (after reprojection). See sf::st_is_valid
Warning: The shape . is invalid (after reprojection). See sf::st_is_valid
Warning: The shape . contains empty units (after reprojection).
Warning: One tm layer group has duplicated layer types, which are omitted. To
draw multiple layers of the same type, use multiple layer groups (i.e. specify
tm_shape prior to each of them).
Warning in g$scale * (w_legend/maxW): longer object length is not a multiple of
shorter object length
Warning in g$scale * (x/maxW): longer object length is not a multiple of
shorter object length
Legend for line widths not available in view mode.

It appears that the highest number of long OD flow is between north and east, ALso noted that the dense activity areas, per the above analysis where we noted the high total trips per origin/destination, are also closley linked as one of the OD flow. These areas are typically the interchange or major stops.

Next we will look at the total trips >10,000 during the specific period.

tm_shape(MPSZ) +   
  tm_polygons() + 
tm_shape(simpleBS_hexagon) +
  tm_polygons() +
flowLine %>%  
  filter(TOTAL_TRIPS>10000)%>%
tm_shape() +
  tm_lines(lwd = "TOTAL_TRIPS",
           style = "quantile",
           scale = c(0.1, 1, 3, 5, 7, 10),
           n = 6,
           alpha = 0.8)
Warning: The shape MPSZ is invalid (after reprojection). See sf::st_is_valid
Warning in g$scale * (w_legend/maxW): longer object length is not a multiple of
shorter object length
Warning in g$scale * (x/maxW): longer object length is not a multiple of
shorter object length
Legend for line widths not available in view mode.

This is quite interesting, as we see the most obvious OD flow is between the woodland checkpoint to woodland interchange / mrt station area. It is also interesting to see there is long distance OD from woodlands to the major Punggol/seletar area.

tmap_mode("plot")
tmap mode set to plotting

Assembling VARIABLES

Next moving on to the assemblling at least three propulsive and three attractiveness variables by using aspatial and geospatial from publicly available sources.

We will look at the following:

  1. Population of Age 7 to 12

  2. School (Primary School) Types

  3. School distinctive programme

  4. HDB Information - Occupancy rate

  5. MPSZ

pop <- read_csv("data/aspatial/pop.csv")
Rows: 332 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): PA, SZ
dbl (3): AGE7_12, AGE13_24, AGE25_64

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

We will focus on age 7 to 12 therefore will remove the other data set, and remove those zero values.

pop <- pop %>%
  select(1:3) %>%
  filter(AGE7_12>0)

This will gives us the PA, SZ, and Total number of Age7_12.

Next we look into the primary school

school <- read_csv("data/aspatial/Generalinformationofschools.csv")
Rows: 346 Columns: 31
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (31): school_name, url_address, address, postal_code, telephone_no, tele...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Note we dont need so many different columns, and we only want to look at primary school:

school <- school %>%
  filter(mainlevel_code == "PRIMARY") %>%
  select(1,3:4,10:11,19:31)
write_rds(school, "data/rds/PrimarySchool.csv") 

next will look into the vairous school distinctive programme

distinctive <- read_csv("data/aspatial/SchoolDistinctiveProgrammes.csv")
Rows: 288 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): school_name, alp_domain, alp_title, llp_domain1, llp_title1, llp_do...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

we will just look into the alp_domain for all primary school

table(distinctive$alp_domain)

                 Aesthetics Business & Entrepreneurship 
                          7                           5 
                     Coding                  Humanities 
                         11                           3 
                ICT & Media     Innovation & Enterprise 
                          8                           6 
          Interdisciplinary                   Languages 
                         40                          34 
     Languages & Humanities                 Mathematics 
                         19                           2 
                       NULL                     Science 
                         55                          27 
                       STEM 
                         71 

Combining this to our primary school data

primary_school <- left_join(school,distinctive) 
Joining with `by = join_by(school_name)`

There are three school without the corresponding distinctive, this is likely due to the primary school is new, we will confirm on this later. The school database is updated Nov 2023, while the data in distinctive was updated in 2021.

The few distinctive programme will serve as key variables in this dataset.

Next we will go back to pop data set:

pop <- left_join(pop, MPSZ, 
                 by =c("SZ" = "SUBZONE_N"))

We do the same to the primary school

primary_school_MSPZ <- left_join(primary_school, MPSZ, 
                            by =c("dgp_code" = "PLN_AREA_N"))
Warning in left_join(primary_school, MPSZ, by = c(dgp_code = "PLN_AREA_N")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 163 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

Noted that the data got NA for those dgp_code = SENG KANG, due to discrepencies in format. Therefore we will amend this and rerun

primary_school$dgp_code[primary_school$dgp_code == "SENG KANG"] = "SENGKANG"
primary_school_MSPZ <- left_join(primary_school, MPSZ, 
                            by =c("dgp_code" = "PLN_AREA_N"))%>%
  select(1,4:5,26:27,8,19:23,30)
Warning in left_join(primary_school, MPSZ, by = c(dgp_code = "PLN_AREA_N")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 163 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

Now we have a primary_school_MSPZ data set that show each primary school, their distinctive programme, and the associated SZ location,and the pop data set that has population aged 7 to 12 by SZ location.

write_rds(primary_school_MSPZ, "data/rds/primary_school_MSPZ.csv") 

We will look into this later.

Preparing Origin and Destination Attributes

Preparing origin attribute

pop <- read_csv("data/aspatial/pop.csv")
Rows: 332 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): PA, SZ
dbl (3): AGE7_12, AGE13_24, AGE25_64

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pop <- pop %>%
  left_join(MPSZ,
            by = c("PA" = "PLN_AREA_N",
                   "SZ" = "SUBZONE_N")) %>%
  select(1:6) %>%
  rename(SZ_NAME = SZ,
         SZ = SUBZONE_C)
mpsz_hexagon <- st_intersection(BusStop_hexagon_sf, MPSZ) %>%
  drop_na()
Warning: attribute variables are assumed to be spatially constant throughout
all geometries
flow_data <- left_join(BusStop_Trips1,mpsz_hexagon, 
                       by =c("ORIGIN_GRID" = "grid_id")) %>%
  select(1:10)%>%
  rename(ORIGIN_SZ = SUBZONE_C,
         ORIGIN_SZ_NAME = SUBZONE_N)
Warning in left_join(BusStop_Trips1, mpsz_hexagon, by = c(ORIGIN_GRID = "grid_id")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 10 of `x` matches multiple rows in `y`.
ℹ Row 11299 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
flow_data <- unique(flow_data)
flow_data <- left_join(flow_data,mpsz_hexagon, 
                       by =c("DESTIN_GRID" = "grid_id")) %>%
  select(1:12)%>%
  rename(DESTIN_SZ = SUBZONE_C,
         DESTIN_SZ_NAME = SUBZONE_N)
Warning in left_join(flow_data, mpsz_hexagon, by = c(DESTIN_GRID = "grid_id")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 2437 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
flow_data <- unique(flow_data)
flow_data1 <- flow_data %>%
  left_join(pop,
            by = c(ORIGIN_SZ = "SZ")) %>%
  rename(ORIGIN_AGE7_12 = AGE7_12,
         ORIGIN_AGE13_24 = AGE13_24,
         ORIGIN_AGE25_64 = AGE25_64) %>%
  select(-c(PA, SZ_NAME))
flow_data1 <- flow_data1 %>%
  left_join(pop,
            by = c(DESTIN_SZ = "SZ")) %>%
  rename(DESTIN_AGE7_12 = AGE7_12,
         DESTIN_AGE13_24 = AGE13_24,
         DESTIN_AGE25_64 = AGE25_64) %>%
  select(-c(PA, SZ_NAME))
write_rds(flow_data1, "data/rds/SIM_data")

Calibrating Spatial Interaction Models

Visualising the dependent variable (TRIPS)

ggplot(data = flow_data1,
       aes(x = TRIPS)) +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Notice that the distribution is highly skewed and not resemble bell shape or also known as normal distribution.

Next, let us visualise the relation between the dependent variable and one of the key independent variable in Spatial Interaction Model, namely distance.

ggplot(data = flow_data1,        
       aes(x = dist,            
           y = TRIPS)) +   
  geom_point() +   
  geom_smooth(method = lm)
`geom_smooth()` using formula = 'y ~ x'

Notice that their relationship hardly resemble linear relationship.

On the other hand, if we plot the scatter plot by using the log transformed version of both variables, we can see that their relationship is more resemble linear relationship.

ggplot(data = flow_data1,        
       aes(x = log(dist),            
           y = log(TRIPS))) +   
  geom_point() +   
  geom_smooth(method = lm)
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 4226 rows containing non-finite values (`stat_smooth()`).

Checking for variables with zero values

Since Poisson Regression is based of log and log 0 is undefined, it is important for us to ensure that no 0 values in the explanatory variables.

In the code chunk below, summary() of Base R is used to compute the summary statistics of all variables:

summary(flow_data1)
  ORIGIN_GRID      ORIGIN_BS       DESTIN_GRID     DESTIN_BS        
 Min.   :   52   54009  :  3448   Min.   :   52   Length:985230     
 1st Qu.: 5916   84009  :  3132   1st Qu.: 6056   Class :character  
 Median : 7473   46009  :  2565   Median : 7494   Mode  :character  
 Mean   : 7293   17009  :  2349   Mean   : 7287                     
 3rd Qu.: 8651   60179  :  2345   3rd Qu.: 8517                     
 Max.   :13117   22009  :  2242   Max.   :13117                     
                 (Other):969149                                     
     TRIPS           FlowNoIntra           offset              dist      
 Min.   :    1.00   Min.   :    0.00   Min.   :0.000001   Min.   :    0  
 1st Qu.:    3.00   1st Qu.:    3.00   1st Qu.:1.000000   1st Qu.: 2030  
 Median :   14.00   Median :   14.00   Median :1.000000   Median : 3994  
 Mean   :   97.22   Mean   :   97.05   Mean   :0.995711   Mean   : 5027  
 3rd Qu.:   53.00   3rd Qu.:   53.00   3rd Qu.:1.000000   3rd Qu.: 7083  
 Max.   :96630.00   Max.   :96630.00   Max.   :1.000000   Max.   :24758  
                                                                         
 ORIGIN_SZ_NAME      ORIGIN_SZ         DESTIN_SZ_NAME      DESTIN_SZ        
 Length:985230      Length:985230      Length:985230      Length:985230     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
 ORIGIN_AGE7_12 ORIGIN_AGE13_24 ORIGIN_AGE25_64 DESTIN_AGE7_12  
 Min.   :   0   Min.   :    0   Min.   :    0   Min.   :   0.0  
 1st Qu.: 180   1st Qu.:  410   1st Qu.: 1860   1st Qu.:  90.0  
 Median : 730   Median : 1580   Median : 7370   Median : 620.0  
 Mean   :1088   Mean   : 2434   Mean   :11264   Mean   : 986.4  
 3rd Qu.:1610   3rd Qu.: 3470   3rd Qu.:16780   3rd Qu.:1480.0  
 Max.   :6340   Max.   :16380   Max.   :74610   Max.   :6340.0  
 NA's   :27     NA's   :27      NA's   :27      NA's   :14      
 DESTIN_AGE13_24 DESTIN_AGE25_64
 Min.   :    0   Min.   :    0  
 1st Qu.:  170   1st Qu.:  900  
 Median : 1300   Median : 6190  
 Mean   : 2200   Mean   :10207  
 3rd Qu.: 3290   3rd Qu.:15830  
 Max.   :16380   Max.   :74610  
 NA's   :14      NA's   :14     

The print report above reveals that variables ORIGIN_AGE7_12, ORIGIN_AGE13_24, ORIGIN_AGE25_64,DESTIN_AGE7_12, DESTIN_AGE13_24, DESTIN_AGE25_64 consist of 0 values.

flow_data1$DESTIN_AGE7_12 <- ifelse(
  flow_data1$DESTIN_AGE7_12 == 0,
  0.99, flow_data1$DESTIN_AGE7_12)
flow_data1$DESTIN_AGE13_24 <- ifelse(
  flow_data1$DESTIN_AGE13_24 == 0,
  0.99, flow_data1$DESTIN_AGE13_24)
flow_data1$DESTIN_AGE25_64 <- ifelse(
  flow_data1$DESTIN_AGE25_64 == 0,
  0.99, flow_data1$DESTIN_AGE25_64)
flow_data1$ORIGIN_AGE7_12 <- ifelse(
  flow_data1$ORIGIN_AGE7_12 == 0,
  0.99, flow_data1$ORIGIN_AGE7_12)
flow_data1$ORIGIN_AGE13_24 <- ifelse(
  flow_data1$ORIGIN_AGE13_24 == 0,
  0.99, flow_data1$ORIGIN_AGE13_24)
flow_data1$ORIGIN_AGE25_64 <- ifelse(
  flow_data1$ORIGIN_AGE25_64 == 0,
  0.99, flow_data1$ORIGIN_AGE25_64)
summary(flow_data1)
  ORIGIN_GRID      ORIGIN_BS       DESTIN_GRID     DESTIN_BS        
 Min.   :   52   54009  :  3448   Min.   :   52   Length:985230     
 1st Qu.: 5916   84009  :  3132   1st Qu.: 6056   Class :character  
 Median : 7473   46009  :  2565   Median : 7494   Mode  :character  
 Mean   : 7293   17009  :  2349   Mean   : 7287                     
 3rd Qu.: 8651   60179  :  2345   3rd Qu.: 8517                     
 Max.   :13117   22009  :  2242   Max.   :13117                     
                 (Other):969149                                     
     TRIPS           FlowNoIntra           offset              dist      
 Min.   :    1.00   Min.   :    0.00   Min.   :0.000001   Min.   :    0  
 1st Qu.:    3.00   1st Qu.:    3.00   1st Qu.:1.000000   1st Qu.: 2030  
 Median :   14.00   Median :   14.00   Median :1.000000   Median : 3994  
 Mean   :   97.22   Mean   :   97.05   Mean   :0.995711   Mean   : 5027  
 3rd Qu.:   53.00   3rd Qu.:   53.00   3rd Qu.:1.000000   3rd Qu.: 7083  
 Max.   :96630.00   Max.   :96630.00   Max.   :1.000000   Max.   :24758  
                                                                         
 ORIGIN_SZ_NAME      ORIGIN_SZ         DESTIN_SZ_NAME      DESTIN_SZ        
 Length:985230      Length:985230      Length:985230      Length:985230     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
 ORIGIN_AGE7_12    ORIGIN_AGE13_24    ORIGIN_AGE25_64    DESTIN_AGE7_12   
 Min.   :   0.99   Min.   :    0.99   Min.   :    0.99   Min.   :   0.99  
 1st Qu.: 180.00   1st Qu.:  410.00   1st Qu.: 1860.00   1st Qu.:  90.00  
 Median : 730.00   Median : 1580.00   Median : 7370.00   Median : 620.00  
 Mean   :1088.08   Mean   : 2434.32   Mean   :11264.18   Mean   : 986.60  
 3rd Qu.:1610.00   3rd Qu.: 3470.00   3rd Qu.:16780.00   3rd Qu.:1480.00  
 Max.   :6340.00   Max.   :16380.00   Max.   :74610.00   Max.   :6340.00  
 NA's   :27        NA's   :27         NA's   :27         NA's   :14       
 DESTIN_AGE13_24    DESTIN_AGE25_64   
 Min.   :    0.99   Min.   :    0.99  
 1st Qu.:  170.00   1st Qu.:  900.00  
 Median : 1300.00   Median : 6190.00  
 Mean   : 2200.29   Mean   :10207.40  
 3rd Qu.: 3290.00   3rd Qu.:15830.00  
 Max.   :16380.00   Max.   :74610.00  
 NA's   :14         NA's   :14        

Origin (Production) constrained SIM

#any(is.infinite(flow_data1$TRIPS))
#orcSIM <- glm(formula = TRIPS ~ 
#                 ORIGIN_SZ +
#                 log(DESTIN_AGE7_12) +
#                 log(dist),
#              family = poisson(link = "log"),
#              data = flow_data1,
#              na.action = na.exclude)
#summary(orcSIM)

Additional - Train Station

TrainStation <- st_read(dsn="data/geospatial",                   
                        layer="RapidTransitSystemStation")%>%   
  st_transform(crs = 3414)
Reading layer `RapidTransitSystemStation' from data source 
  `C:\cftoh\ISSS624\Take-home_Ex\Take-home_Ex2\data\geospatial' 
  using driver `ESRI Shapefile'
Warning in CPL_read_ogr(dsn, layer, query, as.character(options), quiet, : GDAL
Message 1: Non closed ring detected. To avoid accepting it, set the
OGR_GEOMETRY_ACCEPT_UNCLOSED_RING configuration option to NO
Simple feature collection with 220 features and 4 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 6068.209 ymin: 27478.44 xmax: 45377.5 ymax: 47913.58
Projected CRS: SVY21

I see that TrainStation is a Polygon geometry SF with “TYP_CD”, “STN_NAM”, “TYP_CD_DES”, “STN_NAM_DE”, and its geometry.

Note that “TYP_CD” is of all 0 value, “STN_NAM” is of all NA value, while “TYP_CD_DES” value is “MRT” or “LRT”.

Also note that the above shows a warning message:

Warning: GDAL Message 1: Non closed ring detected. To avoid accepting it, set the OGR_GEOMETRY_ACCEPT_UNCLOSED_RING configuration option to NO
invalid_geoms <- TrainStation[!st_is_valid(TrainStation), ]

It appears to have 3 invalid geoms, apart form the NA entry, we can see that the both “HARBOURFRONT MRT STATION” and “UPPER THOMSON MRT STATION” are the invalid geometries.

For the purpose of this exercise, I will filter out these entries and retain on the valid geometries in TrainStation.

TrainStation <- TrainStation[st_is_valid(TrainStation), ]%>%   
  st_transform(crs = 3414)
TrainStation_Exit <- st_read(dsn="data/geospatial",                 
                             layer="Train_Station_Exit_Layer")%>%   
  st_transform(crs = 3414)
Reading layer `Train_Station_Exit_Layer' from data source 
  `C:\cftoh\ISSS624\Take-home_Ex\Take-home_Ex2\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 565 features and 2 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 6134.086 ymin: 27499.7 xmax: 45356.36 ymax: 47865.92
Projected CRS: SVY21

I see that TrainStation_Exit is a Point geometry SF with “stn_name”, “exit_code”, and its geometry.

I get aspatial Data of PASSENGER VOLUME BY ORIGIN DESTINATION TRAIN STATIONS, downloaded via API (postman GET) from Data Mall LTA. For the purpose of this exercise the Aug 2023 Data will be used.

OD_train <- read_csv("data/aspatial/origin_destination_train_202308.csv") #non-spatial data with no geometry features
Rows: 806375 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): YEAR_MONTH, DAY_TYPE, PT_TYPE, ORIGIN_PT_CODE, DESTINATION_PT_CODE
dbl (2): TIME_PER_HOUR, TOTAL_TRIPS

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
OD_train$ORIGIN_PT_CODE <- as.factor(OD_train$ORIGIN_PT_CODE)
OD_train$DESTINATION_PT_CODE <- as.factor(OD_train$DESTINATION_PT_CODE)

Noted that both the aspatial data, for bus and train, are similar where there are total of 7 columns YEAR_MONTH, DAY_TYPE, PT_TYPE, ORIGIN_PT_CODE, DESTINATION_PT_CODE, TIME_PER_HOUR, TOTAL_TRIPS.

We will now first combine the SF data sources relating to Bus Stop, using st_intersection():

TrainStation_combined <- st_intersection(TrainStation, MPSZ)
Warning: attribute variables are assumed to be spatially constant throughout
all geometries

Before proceeding it will be good practice to check the crs of each using st_crs.

We will now first combine the SF data sources relating to Train Station, using st_intersection():

TrainStation_combined <- st_intersection(TrainStation_combined, TrainStation_Exit) %>%
  mutate(STN_NAM_DE = str_replace(STN_NAM_DE, "MRT STATION", "")) %>%
  mutate(STN_NAM_DE = str_replace(STN_NAM_DE, "LRT STATION", ""))%>%
  select(3:13) %>%
  mutate(STN_NAM_DE = str_trim(STN_NAM_DE))
Warning: attribute variables are assumed to be spatially constant throughout
all geometries

Note this data frame does not include the two MRT stations we filtered out earlier.

Note that we have removed the first two column (NA and 0 values), in order to associate the train station code, we will import the following data set and join them:

TrainStation_Code <- readxl::read_excel("data/aspatial/Train Station Codes and Chinese Names.xls")

TrainStation_Code$mrt_station_english <- toupper(TrainStation_Code$mrt_station_english)
TrainStation_combined <- left_join(TrainStation_combined, 
                              TrainStation_Code, 
                              by = c("STN_NAM_DE" = "mrt_station_english"))
Warning in sf_column %in% names(g): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 88 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

For reproducibility, will save into a rds file:

write_rds(TrainStation_combined, "data/rds/TrainStation_combined.csv")  

Next, we are going to append the planning subzone code from TrainStation_combined data frame onto OD_train data frame.

Origin_TrainStation <- left_join(OD_train , TrainStation_combined,
            by = c("ORIGIN_PT_CODE" = "stn_code")) %>%
  select(1:11,16) 
Warning in left_join(OD_train, TrainStation_combined, by = c(ORIGIN_PT_CODE = "stn_code")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 175 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

Looking at the results, noted there are a number of mismatches due to the origin PT code having multiple values. We will need to look into this and clean up the dataset for better visualiztion.

OD_train_filter <- OD_train %>%
  mutate(ORIGIN_PT_CODE = str_replace(ORIGIN_PT_CODE, "/.*", "")) %>%
  mutate(DESTINATION_PT_CODE = str_replace(DESTINATION_PT_CODE, "/.*", ""))

This removes the additional train codes with multiple lines (e.g. Botanic Garden is both Circle line and Downtown line), and only retain one, so to avoid double counting. Then we run the left_join using this dataframe:

Origin_TrainStation_filter <- left_join(OD_train_filter , TrainStation_combined,
            by = c("ORIGIN_PT_CODE" = "stn_code")) %>%
  select(1:11,16) 
Warning in left_join(OD_train_filter, TrainStation_combined, by = c(ORIGIN_PT_CODE = "stn_code")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 175 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

Own Notes

Constructing an O/D Matrix

  • In O/D matrix, the Ti sum of row representing total output of origin location, while sum of column represents input of destination.

  • Can have sub-matrix

  • Additional new activity may change this to new structure

  • O/D Matrix can be costly, 322 x 322 = 110k O/D pairs, which each info need to be carefully provided (but may also change).

  • GPS/ Smart card can collect personal info to represent flows between locations

There are 3 Spatial Interaction Models (First 2 steps)

  • Basic assumption is that function of attributes of location of origin and destination and frictions

  • Potential Model is usually for measuring accessibility

  • Retail model usually used for franchise to choose service area of store of delivery segment

  • This course we will focus on Gravity Model which is the most common model. It uses Newton first law of gravity. Estimated Tij is transition/trip or flow between origin i (row) and destination j(columns). Parameters V, W, d, k lamda, alpha and beta. Beta is always assumed to be negative as increase in cost.distance will likely decrease the interaction.

  • A family of gravity Models - Unconstrained (totally constrained), origin constrained, destination constrained, doubly constrained.

We can see that sp is in list, no geometric column, they are segregated as different table inside object, In tidyverse it is in a whole table. But to call a field in sp, we will need to write something like below

  select(mpsz@data$SUBZONE)